For a large class of applications, the standard RAG architecture is illegal. Not slow, not expensive, illegal. The moment you embed a user's medical history, a company's proprietary code, or regulated PII and ship those vectors to a cloud vector database, you have moved sensitive data across a boundary that GDPR, HIPAA, or an enterprise data-residency policy says it cannot cross. Embeddings are not anonymisation; they are a lossy but reversible-enough representation of the source.
The usual answers are to sign a business associate agreement, self-host the vector DB in a compliant region, or scrub the data first. Sometimes those work. Often the only answer the compliance team will accept is the strongest one: the data never leaves the device at all. That is the architecture I built for a privacy-first wellness app, and here is how on-device RAG works without crippling the phone.
1. The privacy wall, precisely
The compliance objection is not vague nervousness, it is specific. Cloud RAG creates at least three crossings of the trust boundary: the raw text leaves the device to be embedded (unless you embed locally), the vectors are stored on third-party infrastructure, and the retrieved context is sent to a hosted LLM. Each crossing is a place an auditor will stop you. Zero-cloud means collapsing all three: embed on device, store on device, and either generate on device or send only the minimum, user-consented context out.
2. Encryption at rest, anchored in hardware
The foundation is an encrypted database. SQLCipher wraps SQLite with transparent AES-256 encryption: every page is encrypted on disk, and the database is unreadable without the key. The discipline that makes this real rather than theatre is where the key lives. It is never hard-coded and never stored in plain application storage. It lives in the platform's hardware-backed secure enclave (the iOS Keychain / Secure Enclave, the Android Keystore), reached through expo-secure-store, and is released to the app only after device authentication.
import * as SecureStore from 'expo-secure-store';
// Key is generated once, sealed in the hardware-backed keystore,
// and never written to ordinary storage or sent anywhere.
const key = await SecureStore.getItemAsync('db_key', {
requireAuthentication: true,
});
// SQLCipher applies the key; every page is AES-256 encrypted at rest.
db.exec(`PRAGMA key = "x'${key}'"`);
db.exec(`PRAGMA cipher_memory_security = ON`);
If the device is lost, the database on disk is ciphertext and the key is sealed in hardware that resists extraction. That is the property the compliance team actually wants: not "we trust our servers," but "the data is unreadable without this specific authenticated device."
3. Two retrievers, both local
On-device retrieval has to do the same job as cloud hybrid search (catch exact terms and semantic matches) without a server. The answer is the same two-retriever shape, implemented with what the phone already has.
Lexical: SQLite FTS5. SQLite ships a full-text search engine, FTS5, that indexes content for fast keyword and phrase queries entirely in-process. It is the on-device equivalent of BM25, and it is essentially free because the database is already SQLite:
CREATE VIRTUAL TABLE notes_fts USING fts5(content, tokenize = 'porter');
-- Fast, local, no network: ranked keyword retrieval over encrypted content.
SELECT rowid, rank FROM notes_fts
WHERE notes_fts MATCH ?
ORDER BY rank
LIMIT 50;
Semantic: a quantized embedding model. The semantic half runs a small, quantized embedding model on-device. Quantization (int8 or 4-bit weights instead of float32) shrinks the model enough to load and run on a phone without draining the battery or stalling the UI, at a modest and acceptable cost to embedding quality. Each note is embedded once at write time; the vectors are stored in the same encrypted database and compared with cosine similarity at query time. No embedding API call, no vector-DB round trip.
Fuse the two ranked lists the same way cloud RAG does, with Reciprocal Rank Fusion, and you have on-device hybrid retrieval: exact-term recall from FTS5, semantic recall from local embeddings, combined without either touching the network.
4. The tradeoffs, stated honestly
On-device RAG is not a free lunch, and pretending otherwise is how you ship something that dies in the App Store reviews:
- Model quality. A quantized phone-sized embedding model is weaker than a hosted large one. For personal-scale corpora (a user's own notes, not a million documents) the quality is more than enough; for large, nuanced corpora it is a real ceiling.
- First-run cost. Loading the model and embedding an existing corpus on first launch costs time and battery. You do it once, in the background, with a progress state, not on the hot path.
- Storage. Vectors plus the FTS5 index plus the model add device storage. For personal data this is megabytes, not gigabytes, but it is not zero.
- No cross-device sync of plaintext. The whole point is that data stays on the device, which means sync requires end-to-end encryption, not a shared cloud index. That is a feature for privacy and a constraint for product design.
5. The stack
The implementation is a Next.js 15 web app and an Expo 54 React Native mobile app sharing logic, with SQLCipher for the encrypted store, expo-secure-store for hardware-backed key custody, SQLite FTS5 for lexical retrieval, and a quantized on-device embedding model for semantic retrieval. The same hybrid-plus-fusion retrieval shape I use in my cloud RAG engine, rebuilt to run entirely within the trust boundary of a single authenticated phone.
What I Built
This architecture powers the privacy-first retrieval in WellnessInYou, a full-stack wellness platform where user data is personal by definition and on-device by design. It is the same conviction behind the rest of my work: the impressive part of an AI system is rarely the model, it is the engineering around it that makes the system trustworthy enough to put in front of real users with real data.